What kind of corpus did I choose and why?

The corpus that I use for my portfolio will consist of 100 opera songs and 100 musical songs, collected from pre-existing playlists that are available on Spotify. I use Spotify’s Opera 100: Spotify Picks playlist as a basis for my opera playlist. This playlist consists of 97 tracks, so I have added 3 tracks manually, based on Spotify’s suggestions. For my musical playlist, I use a public playlist called BROADWAY MUSICALS, made by Hugo Torres. Out of all musical playlists I could find, I found this to be the most inclusive. Furthermore, I chose to focus on Broadway musicals because they were written to be performed in front of a live audience, just like operas. The original playlist consists of 150 songs, so I manually removed 50 tracks. I chose to remove tracks from musicals which had more tracks in the playlist, in order to create a playlist with as much different musicials as possible.
I chose this corpus because I have always been fascinated by musicals. More recently, I was introduced to operas and I recognized the same compelling drama I appreciate in musicals. Opera songs and musical songs both mainly serve to tell a story, but have very different styles. I wonder if opera and musical music share certain aspects, because they both have such a strong narrative function.  

Natural comparison points
In comparing opera tracks with musical tracks, I expect to find a difference in tempo and danceability. Furthermore, I wonder if opera songs are sadder than musical songs, which might be reflected in the valence. I am curious to find out if the energy and loudness differ between the groups. I expect the liveness, intrumentalness, and speechiness to be similar, because most songs in the corpus are studio recordings and contain vocals.  

Weaknesses of the corpus:
Because adding music from every opera and musical would create a very big corpus, tracks from some operas and musicals are essentially missing (also because most operas and musicals have had a lot of productions with different artists/conductors/musicians). This means that my corpus does not cover the whole genre. Furthermore, Spotify’s pre-existing playlists generally include only the well-known (classical) operas and musicals, leaving out smaller productions.  

Typical tracks:
Habanera – Carmen: for me, this is a typical opera song with very high notes that everyone knows.
La donne e mobile – Rigoletto: again, this is a very famous song. I think the grandeur of this song is typical for opera music.
One Day More - Les Miserables: this song is very dramatic and has multiple singers, which is typical for musical songs.
You Can’t Stop The Beat - Hairspray: the happiness and danceability of this song is typical for musical songs.  

Atypical tracks:
Summertime - Porgy and Bess: this is a jazzy song, which is a different genre than most opera songs.
Ride of the Valkyries - Die Walkure: this song has no lyrics, which is atypical for an opera song.
Totally Fucked - Spring Awakening: this comes close to a rock song, which is a different genre than most musical songs.
Land of Lola - Kinky Boots: this song has a strong disco vibe, with more use of electronic instruments than the average musical song.

Are musical songs really happier than opera songs?


I started by visualizing the distribution of valence against energy for both musical and opera tracks. Furthermore, this plot shows the danceability and tempo of the songs in my corpus. As I expected, musical tracks cover a wider range of valence and energy than opera tracks. I think this might be because musicals can be written in a variety of different styles/genres, whereas operas often have the same style.
I was surprised by the fact that the graph shows so little difference between opera tracks, I would’ve expected at least a little more variety.
Furthermore, as expected, musical songs are generally more danceable than opera songs. It is interesting that musical tracks seem to have a linear relationship: the more energy a track has, the higher the valence. This would mean that there are not many relaxed or angry musical songs.
I highlighted one outlier in both groups. Memory (from the musical Cats) is a particular sad musical song, while Stizzoso, mio stizzoso, voi fate il borioso (from the opera La Serva Padrona) is a particular happy opera song (and also turns out the be the most danceable opera song).
Because the opera songs are not distributed in an even way, it is hard to see the data. That’s why I decided to take a closer look.

Take a closer look on the opera playlist


In this plot, another outlier stands out: Et maintenant je dois offrir (from the opera Les Huguenots) has the highest energy of all opera songs in the corpus. I manually chose different x and y limits for this plot, so the datapoints are somewhat clearer now. However, they are still very much clustered together in the low-Energy-low-Valence corner.
The tempo and danceability features are quite hard to read from these graphs, so I made seperate histograms for each features to get more insight into the two groups.